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Abstract 

Compliant control is a standard method for performing fine manipulation tasks, like grasping and assembly, 
but it requires estimation of the state of contact between the robot arm and the objects involved. Here 
we present a method to learn a model of the movement from measured data. The method requires little 
or no prior knowledge and the resulting model explicitly estimates the state of contact . The current state 
of contact is viewed as the hidden state variable of a discrete HMM. The control dependent transition 
probabilities between states are modeled as parametrized functions of the measurement. We show that 
their parameters can be estimated from measurements concurrently with the estimation of the parameters 
of the movement in each state of contact . The learning algorithm is a variant of the EM procedure. The 
E step is computed exactly; solving the M step exactly would require solving a set of coupled nonlinear 
algebraic equations in the parameters. Instead, gradient ascent is used to produce an increase in likelihood. 
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1 Introduction 

For a large class of robotics tasks, such as assembly tasks 
or manipulation of relatively light-weight objects, under 
appropriate damping of the manipulator the dynamics of 
the objects can be neglected. For these tasks the main 
difficulty is in having the robot achieve its goal despite 
uncertainty in its position relative to the surrounding 
objects. Uncertainty is due to inaccurate knowledge of 
the geometric shapes and positions of the objects and of 
their physical properties (surface friction coefficients), or 
to positioning errors in the manipulator. The standard 
solution to this problem is controlled compliance [9]. Un- 
der compliant motion, the task is performed in stages. 
In each stage the robot arm maintains contact with a se- 
lected surface or feature of the environment. The stage 
ends when contact with the feature corresponding to the 
next stage is made. The goal itself is usually defined as 
satisfaction of a set of contact constraints rather than 
fixed position and force references. 

Decomposing the given task into subtasks and speci- 
fying each goal or subgoal in terms of contact constraints 
has proven to be a particularly fertile idea, from which 
a fair number of approaches have evolved. But each of 
them have to face and solve the problem of estimating 
the state of contact (i.e. checking if the contact with the 
correct surface is achieved), a direct consequence of deal- 
ing with noisy measurements. The earliest approaches 
such as guarded moves [2] and compliance control [9] are 
problem dependent and deal only indirectly with uncer- 
tainty . Preimage fine motion planning [8] directly in- 
corporates uncertainty and provides a formal problem 
statement and approach, but becomes computationally 
intractable in the case of noisy measurements. Attempt- 
ing to overcome this last difficulty are the proposals of 
[10] local control around a nominal path (LCNP) and 
[5] feature based programming. The latter treats the 
uncertanty in position and velocity in a probabilistic 
framework. It associates to each state of contact the 
coresponding movement model] that is: a relationship 
between positions and nominal 1 and actual velocities 
that holds over a domain of the position-nominal veloc- 
ity space. Then it constructs an observer which uses 
a sequence of recent measurements in order to decide 
which is the current movement model. All the above ap- 
proaches assume prior geometrical and physical knowl- 
edge of the environment. 

In this paper we present a method to learn a model of 
the environment which will serve to estimate the state of 
contact and to predict future positions from noisy mea- 
surements. The current movement model is viewed as 
the hidden state variable of a discrete HMM. The control 
dependent transition probabilities between states are ex- 
plicitly modeled as parametrized functions of the mea- 
surement. We call this model Markov Mixture of Experts 



(MME) and show how its parameters can be estimated. 
The rest of the paper is organized as follows: Section 2 
defines the problem, in section 3 the learning algorithm 
is derived, section 4 presents a simulated example and 
section 5 discusses other aspects relevant to the imple- 
mentation and directions for further work. 

2 Reachability Graphs and Markov 
Mixtures of Experts 

For any assembly of objects, the space of all the relative 
degrees of freedom of the objects in the assembly is called 
the configuration space (C-space). Every possible config- 
uration of the assembly is represented by a unique point 
in the C-space and movement in the real space maps into 
continuous trajectories in the C-space. The sets of points 
corresponding to each state of contact create a partition 
over the C-space. Because trajectories are continuous, a 
point can move from a state of contact only to one of the 
neighboring states of contact . This can be depicted by 
a directed graph with vertices representing states of con- 
tact and arcs for the possible transitions between them, 
called the reachability graph. If no constraints on the ve- 
locities are imposed, then in the reachability graph each 
state of contact is connected to all its neighbours. But if 
the range of velocities restricted, the connectivity of the 
graph decreases and the connections are generally non- 
symmetric. Figure 1 shows an example of a C-space and 
its reachability graph for velocities with only positive 
componenets. On the other hand, longer movement du- 
rations between observations could allow more than one 
state transition to occur, leading to an effective increase 
in the number of neighbours of a state. 

Ideally, in the absence of noise, the states of contact 
and every transition through the graph can be perfectly 
observed. To deal with the uncertainty in the measure- 
ments, we will attach probabilities to the arcs of the 
graph in the following way: Let us denote by Qi the set of 
configurations corresponding to state of contact i and let 
the movement of a point x with uniform nominal velocity 
v for a time AT be given by x(t + AT) = f*(x, v, AT); 
both x and v are vectors of same dimension as the C- 
space. Now, let x f , v f be the noisy measurements of the 
true values x, v, x E Qj and P[x, v\x f , t/, j] the poste- 
rior distribution of x, v given the measurements and the 
state of contact . Then, the probability of transition to a 
state i from a given state j in time AT can be expressed 
as: 



P[i\x\v\j] = / P[x,v\x f ,v f ,j]dx dv 

J{x,v\xeQj,f*(x,v,AT)eQi} 



The nominal velocity is the velocity of the arm under 
the same actuator force if it was moving in free space. It is 
proportional in size and identical in orientation to the force 
vector exerted by the actuators (the active force). When the 
object is subject to reactive forces from the surfaces it is in 
contact with, the actual velocity does not coincide with the 
nominal velocity. 



ij(x',v') 

(!) 
We define a transition probability matrix A — [ajz]™7=i 

and assume a measurement noise pfx'^ = i, x E Qi]. 
This allows us to construct a Hidden Markov Model 
(HMM) with output x having a continous emission prob- 
ability distribution p and where the state of contact plays 
the role of a hidden state variable. Our main goal is to 
estimate this model from observed data. 

To give a general statement of the problem we will as- 
sume that all position, velocity and force measurements 
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Figure 1: Restricted movement in a 2D space (a) and the corresponding reachability graph (b). The nodes represent 
movement models: C is the free space, A and B are surfaces with static and dynamic friction, G represents jamming 
in the corner. The velocity v has positive components. 



are represented by the input vector u; an output vec- 
tor y of dimensionality n y contains the future position 
(which our model will learn to predict). Observations 
are made at integer multiples of time AT, indexed by 
t = 0,1, ..T. If AT is a constant sampling time the 
dependency of the transition probability on AT can be 
ignored. For the purpose of the parameter estimation, 
the possible dependence between y(i) and u(i + 1) will 
also be ignored, but it should be considered when the 
trained model is used for prediction. 

Throughout the following section we will also make 
the assumption that the input-output dependence is de- 
scribed by a gaussian conditional density p{y{t)\q(t) — k) 
with mean /(u(t),0&) and variance E = a 2 I. This is 
equivalent to assuming that given the state of contact 
all noise is additive gaussian output noise, which is ob- 
viously an approximation. But this approximation will 
allow us to derive certain quantities in closed form. 

The function f(u,0k) is the movement model associ- 
ated with state of contact k (with 9 k its parameter vec- 
tor) and q is the selector variable which takes on values 
from 1 to m. Sometimes we will find it useful to parti- 
tion the domain of a movement model into subdomains 
and to represent it by a different function (i.e a different 
set of parameters 9 k ) on each of the subdomains; when 
f(u,0k) will bear physical significance, the name move- 
ment model will be to them, but in general each f(u, 9 k ) 
will be refered to as a module. 

The evolution of q is controlled by a Markov chain 
which depends on u and of a set of parameters Wj : 

aij (u(t), Wj) = Pr[q(t+1) = i\q(t) = j, u(t)} t = 0, 1, . . . 

with 
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Figure 2 depicts this architecture. It can be easily seen 
that this model generalizes the mixture of experts ar- 
chitecture [7], to which it reduces in the case where 0{j 
are independent of j (the columns of A are all equal). 
It becomes the model of [1] when A and / are neural 
networks. A model where the Markov chain transitions 
are implemented by a recurrent network was proposed 
in [3]. 

3 An EM Algorithm for MME 

values of the unknown parameters 
= 1, . . . , m given the sequence of ob- 
2/5-57, T > and a prior probability of 



x,-(0) = Pr[q(0) = j}. 

the Expectation Maximization (EM) [4] algorithm will be 
used. 

The EM algorithm is an iterative procedure which 
converges asymptotically to a local maximum of the like- 
lihood function. It requires the introduction of unob- 
served variables, which, in our case will be the hidden 
state variables {q(t)} t=Q . Then it attempts to maximize 
a complete likelihood function l c which depends on the 
extended set of variables. 

This process is iterative, each iteration comprising two 
steps. The first step (Expectation or E) finds the distri- 
bution of the unobserved variables given the observed 
variables and the current estimates of the parameters; 
then it computes the expectation of l c w.r.t. this distri- 
bution. 



]aij(u 3 Wj) = 1 j = 1, . . . , m Vi/, Wj 



(2) 



s-j — — = { s(ti), s(ti + 1), ... 5 5(^2)} denotes the sequence 
of values of the variable 5 over the time interval ti , . . . , £2 • 




Figure 2: The Markov Mixture of Experts architecture 



The Maximization (M) step reassigns to the unknown 
parameters the values which maximize the expected 
complete likelihood. 

For our problem the complete likelihood is: 
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and = {#i, m }. The first factor does not depend on 
and u and can be written as 
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Let us now introduce the notations 



M*) 



1 g(t)=_i 
otherwise 



7 ,-(<) = Pr[g(<) = i | « w , y^, W, 0, a 2 , tt(0)](6) 

^•(f) = Pr[g(t) = j, g(t + 1) = i | ujj, 

y^, W, 0, a 2 , tt(O)] 

for t = 0, . . . , T, i, j = 1, . . . , m. 

The last two quantities are well-known in the HMM liter- 
ature and can be computed efficiently by means of a pro- 
cedure which parallels the forward-backward algorithm 

[ii]. 

Thus, the complete likelihood l c becomes 
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t = fc = l 

Since this is a linear function of the unobserved 
variables represented by zK t)) taking its expectation is 

straightforward. Moreover, by the definitions (6), we 
have 

E[z\ (t) 1 2/oTr, u w , W, 0, tt(0)] = 7i (t) (8) 

E Kd) z kt+i) I yw> u o?> w > e > <W = & W ( Q ) 

for i, j = 1, . . . , m. 

In the M step the new estimates of the parameters are 
found by maximizing the average complete log-likelihood 
J, which in our case has the form 

J(0,a 2 ,W) = 



E[l c \y^r, u w , W, #, <r 2 ,7r(0)] (10) 

1 T m 

-o^££7*«ii!/(*)-/(«(*)A)ii 2 



t = k = l 



T-l m 

+ E E^-w ina o- («(*). ^-) (ii) 

t = zj=l 
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where C is a constant. Since each parameter appears in 
only one term of J the maximization is equivalent to: 
T 
e n k ew = argmin^ 7 *Wll!/W " /(«(*) A)l| 2 (12) 
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By solving (12)-(14) a new set of values for the param- 
eters is found and the current step of the EM procedure 
is completed. 

The difficulty of (12) and (13) strongly depends on the 
form of the movement models / and of the transition 
matrix A. The complexity of the movement models is 
determined by the geometrical shape of the objects' sur- 
faces. For planar surfaces and no rotational degrees of 
freedom / is linear in 6^. Then, (12) becomes a weighted 
least squares problem which can be solved in closed form. 

The functions in A depend both on the movement 
and on the noise models. Because the noise is prop- 
agated through non-linearities to the output, a closed 
form as in (1) may be difficult to obtain. Moreover, a 
correct noise model for each of the possible uncertainties 
is rarely available [5]. A common practical approach is 
to trade accuracy for computability and to parametrize 
A in a form which is easy to update but devoid of physi- 
cal meaning. In all the cases where maximization cannot 
be performed exactly, one can resort to Generalized EM 
by merely increasing J . In particular, gradient ascent in 
parameter space is a technique which can replace max- 
imization. This modification will not affect the overall 
convergence of the EM iteration but can significantly re- 
duce its speed. 

Because EM only finds local maxima of the likelihood, 
the initialization is important. If f(u,0k) correspond 
to physical movement models, good initial estimates for 
their parameters may be available. The same can apply 
to those components of W which bear physical signifi- 
cance. A complementary approach is to reduce the num- 
ber of parameters by explicitly setting the probabilities 
of the impossible transitions to 0. 

4 Simulation Results 

4.1 Problem and implementation 

The problem is to learn a predictive model for the move- 
ment of a point in the 2-dimensional space shown in fig- 
ure 1, a. The inputs were 4-dimensional vectors of posi- 
tions (x,y) and nominal velocities (v x ,v y ) and the out- 
put was the predicted position (x out, V out)- The coordi- 
nate range was [0, 10] and the admissible velocities were 
confined to the upper right quadrant (v x , v y > Vmin > 0) 
and had magnitudes (after multiplication by AT) be- 
tween and 4. The restriction in direction guaranteed 
that the trajectories remained in the coordinate domain. 
It also reflected in the topology of the reachability graph, 
which is not complete (has no transition to the free space 
from another state, for example). The magnitude range 
was chosen so that all the transitions in the graph have 
a non-negligible chance of being observed. A detailed 
description of the physical model is given in Appendix 
A. 

We implemented this model by a MME. The state 
q(t) represents the expert used up to time t and dij(t) 



the probability of using expert i between t and t-\-l given 
that q(t) = j. 

Implementation of the experts /: The ex- 
perts f(u,0k) were chosen to be linear in the pa- 
rameters, corresponding to the piecewise linearity of 
the true model (see Appendix A). Linearity is 
achieved by introducing the additional input variables 
v x /vy, v y /v x , xv y /v x , yv x /v y , each of them having a 
new parameter as coefficient. But each additional vari- 
able affects only one of the 6 experts (see Appendix A). 
Therefore, by including the additional variables only in 
the models that depend on them, we have an insignif- 
icant increase in the number of parameters. To avoid 
infinite input values the the values of v x , v y were lower- 
bounded by the small constant V m in — 1/50. 

Implementation of the transition probability 
matrix: To implement the transition matrix A we used 
a bank of gating networks, one for each column of A. Ex- 
amining figure 1 it is easy to see that there exist experts 
that share the same final state of contact (for instance, 
A stick and A slide both represent movements whose 
final position is on surface A). Since transition proba- 
bilities depend only on the final position the columns of 
the matrix A corresponding to these experts are equal. 
This brings the number of distinct gating networks to 4. 

The boundaries of the decision regions are curved sur- 
faces, so that to implement each of them we used a 2 layer 
perceptron with softmax output, as presented in figure 
3. The number of hidden units in each gating net was 
chosen considering the geometry of the decision regions 
to be learned. 

4.2 Training and testing criteria 

The training set consists of TV = 5000 data points, in 
sequences of length T < 6, all starting in the free space. 
The starting positon of the sequence and the nominal ve- 
locities at each step were picked randomly. It was found 
that for effective learning it is necessary that the state 
frequencies in the training set are roughly equal. The 
distribution over velocities and the sequence length T 
were chosen so as to meet this requirement. The (x, y) 
values obtained by simulation were corrupted with gaus- 
sian additive noise with standard deviation a. 

In the M step, the parameters of the gating networks 
were updated by gradient ascent, with a fixed number 
of epochs for each M step. Appendix C shows how the 
gradient was computed. We used weighted least squares 
estimation for the movement models parameters. To en- 
sure that models and gates are correctly coupled, initial 
values for are chosen around the true values. In the 
present case, this is not an unrealistic assumption. W 
was initialized with small random values. Because of 
the long time to convergence, only a small number of 
runs have been performed. The observed variance of the 
results over test sets was extremely small, so that the 
values presented here can be considered as typical. 

Three criteria were used to measure the performance 
of the learning algorithm: experts' parameters deviation 
from the true values, square root of prediction MSE and 
hidden state misclassificaton. Because training was per- 
formed on a distribution that is not expected to appear 
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Figure 3: Gating network computing the transition probabilities a^ for state of contact k. 
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in practice, all the models were tested on both the train- 
inng distribution and on a distribution which is uniform 
over (v x , v y ) (and therefore highly non-uniform over the 
states of contact), subsequently called the "uniform V 
distribution" . 

Performance comparisons were made with a ME ar- 
chitecture having identical experts but only one gating 
net. For the number of hidden units in it, various val- 
ues have been tried; the results presented here represent 
the performance of the best model obtained. The per- 
formance of a k-nearest neighbor (k-NN) model is also 
shown. 

4.3 Results 

Learning curves. Figure 4 presents the learning curves 
for two MME and two ME models. The horizontal axes 
show the number of EM epochs. Since the number of 
backpropagation epochs for each M step is 2000, the 
number of training epochs for the ME models was scaled 
down accordingly. The ME models converge faster than 
the MME models. The differrence may be due to the 
choice of a fixed number of epochs in the M step; the 
MME model also contains more parameters in the gat- 
ing nets (220 vs ME's 180 parameters). The experts 
contain 64 parameters. 

The only model which achieves misclassification er- 
ror is the MME trained without noise. For training in 
noise, the final prediction error on the training set is the 
same for both architectures, but the results on test sets 
will show a difference favouring the MME model. 

Comparison with k-NN. The k-NN method per- 
formed much worse than the other two algorithms. The 
following table presents a summary of the best results 
obtained. 

Test set performance. Figure 5 presents the test 
set performance of MME and ME models trained with 
and without noise. Each test set contained > 50,000 
datapoints. The input noise added to the (x, y) values 
in the test sets was gaussian with variance between 
and (0.4) 2 . It can be seen that the ME models perform 
similarly w.r.t. both prediction error and state classifi- 
cation. The two MME models perform better than the 



Table 1: Performance (MSE 1 / 2 ) for the k-NN 

method. Memorized (training) set: size=l 1,000; 
noise=<r m . Test set: size=l 1,000; noise=<Tt. 
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ME models, with a significant differrence between the 
model trained with noise and the model trained with 
ideal data. The results are consistent over distributions, 
noise levels and performance criteria. 

Parameter error. In the present problem the true 
parameters for the experts can be computed exactly. 
The values of the MSE of the learned parameters in var- 
ious models are shown in the following table (where a 
represents the noise level in the training set): 



Model 


a noise 


MSE* /2 


MME 
MME 
MME 




0.05 

0.2 




0.0474 

0.0817 


ME 
ME 



0.2 


0.1889 
0.1889 



Prediction along a trajectory. To illustrate the 
behaviour of the algorithm in closed loop, figure 6 
presents a sample trajectory and the predictions of the 
MME model in open and closed loop mode. In the lat- 
ter, the predicted and measured positions were averaged 
(with ratios 1/2) to provide the models' (x, y) input for 
the next time step. 

The simulations show that, although input noise is 
not explicitly taken into account by the model, the MME 
architecture is tolerant to it and is able to achieve both 
learning and good prediction performance in noisy con- 
ditions. 

The comparison with k-NN confirms that the prob- 
lem is not simple and that well tailored algorithms are 
required to solve it. MME outperforms ME in all situa- 
tions, with a small additional computational effort dur- 
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Figure 4: Prediction standard error (MSE 1 ' 2 ) (a) and percentage of wrong expert choices (misclassified states) (b) 
on the training set during learning for the MME and ME models. The abscissa was scaled according to the number 

of backpropagation epochs for the two models. — MME, no noise; • • • MME, noise 0.2; - • - ME, no noise; ME, 

noise 0.2. 
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Figure 5: Test set performance of the MME and ME models for various levels of the input noise <r zn on two different 
distribuitons. (a) and (b): (Prediction MSE) 1 ' 2 — <Ji n ; (c) and (d): the percentage of misclassified states, x - MME, 
no noise; * - MME noise 0.2; o - ME, no noise; + - ME, noise 0.2. 
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Figure 6: Sample trajectory in the (x, y) space, together with the open and closed loop predictions of the MME 
model. The noise level is 0.2. 



ing training. This demonstrates the advantage of taking 
into account the time-dependencies in the data. 

5 Discussion 

The learning algorithm and the architecture presented 
here allow us to model a wide range of dynamic systems. 
Theoretically any finite state discrete time system can 
be represented in the form of a MME, and continuous 
state spaces can be often conveniently discretized. But 
the aim of using a MME is to take advantage of the 
natural modularity of the process we are trying to model 
and to divide it into subprocesses of significantly lower 
complexity. 

In the special case when the MME reduces to a mix- 
ture of experts, the probability of a state depends only 
on the input u, and it is easy to see (also cf. [7]) that 
the gating network partitions the input space, assigning 
to each expert a subset of it. As the domains of the 
experts are generally contiguous, the mixture of experts 
can be viewed as a combination of local models, each 
being accurate in a confined region of the input space. 

From the same point of view, the MME divides the 
data into regimes and constructs locally accurate models 
of dynamic processes corresponding to each of them. The 
partition performed is twofold: the dynamic component 
is confined to the level of the Markov chain and, at the 
same level, the set of input-output pairs is partitioned 
into subsets which are assigned to the static experts. 
The former is ensured by the structure of the model, the 
latter has to be learned during the E step of the EM 
algorithm. 

This suggests that a critical condition for accurate 
learning is the correct assignment of the data points to 
the experts responsible for them. This is basically a clus- 
tering problem. Therefore, although the learner is pro- 
vided with "input-output" pairs, it is esssentially per- 
forming an alternation of supervised and unsupervised 
learning tasks. 



5.1 Computational aspects 

Local maxima. The EM algorithm is guaranteed to 
converge to a local maximum of the likelihood. This 
may not be sufficient when the value of the likelihood 
at the local maximum is much smaller than the value 
at the global maximum. Another issue is that in some 
cases we may be in search of parameters with physical 
meaning. In these cases, we need to find the (local) 
maximum that is closest to the parameters of the data 
generating process. 

The first issue can be addressed by performing sev- 
eral runs of the learning algorithm, each starting from 
a different initial point in the parameter space, and by 
chosing at the end the most likely solution. To address 
the second issue, besides validating the values of the pa- 
rameters after convergence, prior knowledge about the 
process can be used to find a good initial estimate. 

Influence of / and a^ . We have imposed no condi- 
tions on the form of / and aij so far. They influence the 
success of learning in several ways: 

1. they determine the difficulty of the Maximizaton 
step, as discussed in 3. 

2. / influences the difficulty of the clustering problem; 
a linear / reduces it to gaussian clustering, whereas if the 
class of allowed I/O functions becomes too rich or if the 
mapping 9 — ► /(•, 0) is not one-to-one the assignment 
problem can become ambiguous. 

3. the form of / and aij determines the number of 
parameters 3 and thereby the number of data points nec- 
essary for accurate learning. 

4. by 2. and 3. the shape of the likelihood function is 
influenced and subsequently the number and distribution 
of the local maxima 

Number of states. So far we have assumed that 
m, the number of experts, is known. This is not always 
the case in practice. Simulations suggest that starting 
with a large enough m could be a satisfactory strat- 
egy. Sometimes the superfluous clusters are automati- 



More precisely, the complexity of the model. 



cally "voided" of data points in the process of learning. 
This can be attributed to the fact that, for any fixed 
model structure, maximizing likelihood is equivalent to 
minimizing the description length of the data [12]. 

On the other hand, the number of parameters in A in- 
creases proportionally to m 2 , increasing both the compu- 
tational burden and the data complexity. The clustering 
becomes also harder for a large number of alternatives. 
Therefore it is important to know m or to have a good 
upper bound on it. 

Data complexity. The data complexity of a model 
is, loosely speaking, the number of data points required 
to attain a certain level of the prediction error. In the 
case where we can talk about true parameters, the pre- 
diction error is reflected by the accuracy of the parame- 
ters. For a more rigorous and detailed discussion of data 
complexity consult [6] and its references. For the pur- 
poses of this paper it is sufficient to state that for a given 
class of models, the data complexity increases with the 
number of parameters. As a consequence, an increase 
in m will induce a quadratic increase in the number of 
parameters and in the training set as well. 

Simulations have shown that training this architecture 
requires large amounts of data. Therefore it is necessary 
to use prior knowledge to reduce the number of parame- 
ters. This and other uses of knowledge outside the data 
set for constraining the model are the topics of the next 
subsection. 

5.2 Incorporating prior knowledge 

Prior knowledge is extremely valuable in any practical 
problem. Although the ML paradigm doesn't provide a 
systematic way of incorporating it, prior knowledge can 
and should be used at several levels of the model. 

First, it will help in finding the appropriate model 
structure. An important structural parameter is the 
number of states of the Markov chain. For instance, 
knowing the number of the movement models in the fine 
motion task has allowed us to pair each expert with 
a movement model, thus having simple linear experts 
whose parameters have a physical meaning. For systems 
that are not intrinsically discrete-time, prior knowledge 
should guide the choice for an appropriate discretization 
step. A too long discretization step may miss transitions 
between states leading thus to poor modelling, whereas 
a too short one will cause unnecesarily low transition 
probabilities between different states. 

Closely related to the choice of the number of states 
is the choice of the functional form of the experts. This 
has been shown in the modeling of the robot arm motion. 
Prior knowledge should help find the class of functions 
/ with the suitable complexity. This isssue, as well as 
the issue of the best representation for the inputs and 
outputs, arise in any modeling problem. 

A similar selection concerns the form of the functions 
in A. In view of condition (2) it is often reasonable to 
group the transition probabilities into gating networks as 
we have done in section 4 thus having a vector of param- 
eters Wj for each set of functions {aij \i = 1, . . . , m}. If 
it is known a priori that certain transitions cannot occur, 
then their probabilities can be set to 0, thereby reducing 



both the complexity of the gating network and speeding 
up the learning process by reducing uncertainty in the 
state estimation. For instance, in the fine motion prob- 
lem, when the state-space grows by adding new objects 
to the environment, the number of possible transitions 
from a state of contact (also called the branching factor) 
will be limited by the number of the states of contact 
which are within a certain distance and can be reached 
by a simple motion, which will grow much slowlier than 
m. Therefore, the number of nonzero elements in A will 
be approximately proportional to m. This is generally 
the case when transitions are local, providing another 
reason why locality is useful. 

Finally, when the experts have physical meaning, 
prior knowledge should give "good" initial estimates for 
the experts parameters. This is important not only to 
decrease the learning time, but also to avoid convergence 
to spurious maxima of the likelihood. 

Relearning. Certain aspects of relearning have been 
discussed at the end of the previous section. (By relearn- 
ing we mean learning a set of parameters, under the 
assumption that another learning problem has already 
been solved and that the old model is identical in struc- 
ture to the new one. Sometimes it is also assumed that 
the old and new parameter values are "close".) Since 
EM is a batch learning algorithm, it cannot automat- 
ically adapt to changes in the parameters of the data 
generating process. But knowledge exterior to the data 
set (that was called prior knowledge in the above para- 
graphs) can still be used to make relearning easier than 
learning from scratch. This process can be viewed as 
the reverse of incorporating prior knowledge: we must 
now (selectively) discard prior knowledge before learn- 
ing from the new data begins. Obviously, the model 
structure remains unchanged; the question is whether to 
keep some of the parameter values unchanged as well. 
The answer is yes in the case when we know that only 
some of the experts or gating nets have suffered changes, 
in other words when the change was local. In this case 
the learning algorithm can be easily adapted to locally 
relearn, i.e. to reestimate only a subset of the parame- 
ters. 

6 Conclusion 

To conclude, the MME architecture is a generalization 
of both HMMs and the mixture of experts architecture. 
From the latter it inherits the abilty to take advantage of 
the natural decomposability of the processes to model, 
when this is the case, or to construct complex models 
by putting together simple local models otherwise. The 
embedded Markov chain allows it to account for simple 
dynamics in the data generation process. 

The parameters of the MME can be estimated by an 
iterative algorithm which is a special case of the EM al- 
gorithm. As a consequence, the a posteriori : probabilities 
of the hidden states are also computed. Multiple aspects 
pertaining to the implementation of the algorithm have 
been discussed. The trained model can be used as a 
state estimator or for prediction as the forward model in 
a recursive estimation algorithm. 
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A The exact movement model for the 
simulated example 

Here the equations describing the exact movement model 
of the point in the C-space shown in fig. 1 are given. First, 
some notation. The position of the point at the current 
time is denoted by (x,y) and its position after moving 
for a time AT with constant velocity vector (v x ,v y ) by 
(x new ,y new ). 

The static and dynamic friction coefficients on sur- 
faces A and B are fi s A , fi A , fi s B , fi B ; we will consider that 
x E [0, LX], y E [0, LY]. With the movement model 
notation introduced in figure 1 the movement equations 
are: 

C: 



x 



x + v x AT 
y + v y AT 



(15) 



x 

y 

A stick 



= x + sgn(i^)sgn(i; y )/^(YL - y) 

,new V T 



+AT(v x - fi d A v y ) (16) 



x + v x 



YL 



YL-y 



(17) 

(18) 



B slide: 



XL 

y + sgn(v x )sgn(v y )fi d B (XL - x) (19) 

+AT(v y -fi d B v x ) 



B stick: 



XL 



y + v y : 



XL-x 



y 



XL 
YL 



(20) 
(21) 

(22) 



In the above, sgn denotes the sign function 

sgn(z) = 



1, z> 

-1, z<0 
0, z = 0. 



Equations (16-21) are slightly more general than needed 
for the present case. For velocities restricted to v x , v y > 
Vmin > the sgn functions are 1 and can be dropped, 
yielding all equations but (17 - 21) linear. 

The former can also be brought to a linear form by 
introducing the input variables 

Vx/vy, v y /v x , (x.v y )/v x , (y.v x )/v y . The coefficients of 
the extended set of variables are denoted by 0^ , . . .0^ 
and represent the parameters of the model to be learned. 
The coefficients corresponding to v x /v y , v y /v x , etc. are 
considered to be and are not subject to learning for 
the equations which don't contain these variables. 

The domain of each movement model is the region in 
the 4-dimensional space where the given model is valid. 
This region is easily defined in terms of a set of inequal- 
ities which must be satisfied by all its points. For exam- 
ple, the (preimage of) free space C is the set of points 
for which < x new < XL, < y new < YL. If we define 
the functions: 



dl(x,y,v x ,v y ) 
d2(x,y,v x ,v y ) 



x + v x AT 
y + VyAT 



(23) 



the following is an equivalent definition for the domain 
of C : 

< dl <XL 

< d2 <YL 



but now the functions dl and d2 can be used in inequali- 
ties defining the domains of neighboring movement mod- 
els. The complete list of functions which define the do- 
main boundaries follows. 



dl 
d2 
d3 
dA 
db 
d6 
dl 



x -\- v x AT 
y-\-v y AT 



i YL-y 






XL- 



x + sgn(v x ) sgn(v y ) iJ d A (Y L - 

+AT(v x - n d A v y ) 

y + sgn(^) sgn(v y )fi d B (XL 

+AT(v y - fi d B v x ) 



y) 

■ x) 



In addition, for all the movement models hold the in- 
equalities: 



mm 
M 



< x < 
< y < 

n _^ V x 

< V v 



XL 
YL 



< V m 



It can be seen that the model boundaries are not linear 
in the inputs. Even if made linear by augmenting the 
set of inputs as above, some domains are not convex sets 
and others, although convex, cannot be represented as 
the result of a tesselation with softmaxed hyperplanes. 

B The sigmoid and softmax functions 

The definition of the sigmoid function used in the present 
example is 



sigmoid^) = h(x) — tanh(x) 



(24) 



where x is a real scalar. The sigmoid function maps the 
real line into the interval (-1,1). Its derivative w.r.t. the 
variable x can be expressed as 

ti(x) = l-h 2 (x) 

The softmax function is the multivariate analog of the 
sigmoid. It is defined on lZ n — » lZ m by: 



softmax z (x, W) = gi(x, W) 



exp(W?x) 
£ ? exp(^)' 



i — 1, ..m 



with W, x vectors of the same dimension.) If we make 
the notation s z - = W?x, then the partial derivatives of 
the softmax function are given by 



dgj 

dsj 



9i($ij ~9j) Vi,i (25) 

where 6{j represents Kronecker's symbol. 

C Gradient calculations 

Here we give the calculation of the derivatives used in 
the generalized M-step of the EM algorithm. 



To obtain the new iterates of the parameters of the 
gating networks, one has to maximize the function given 
by (13) w.r.t. to W 



T-l 



Jw = EE^W ln ^W f )'^)) 



t = ij 

In the present implementation, the functions a z -j are 
computed by a two layer perceptron with softmax out- 
put. The complete calculation is given below, and the 
notation is explained by figure 3. 

a ik (u(i),W) = a ik (t) - softmax z (/7 fc , Wa fc ) 

h kj (t) = sigmoid[Wx k jx(t) + Wy kj y(t) + Wvx kj v x (t) 
+Wvy fcj .T, y (*)+Wt fcj -] 

where k = 1, .., m, i = 1, .., m f k , j = 1, ..., n k + 1 and 
m, m! , n k represent the number of states, of outputs and 
of hidden units for gate k, respectively. The parameter 
matrices and vectors Wx, Wy, Wvx, Wvy, Wt and Wa 
have the appropriate dimensions. By f — » k we denoted 
the fact that module f leads to state of contact k. This 
notation is necessary because there can be more than 
one module leading to a given state of contact . (For 
example, modules A stick and A slide both lead to 
contact with surface A). The last unit in each hidden 
layer is fixed to 1 to produce the effect of a threshold in 

$k- 

Expressing the derivatives as in Appendix B and usig 
the fact that 



!>>(*) = 7iW Vi,j 



we obtain the following formulas for the partial deriva- 
tives of Jw • 



d,h 



w 



dWa 



kij 



y^ dJw da Ik (t) 



Mk(t) dWa ki 



tj 



dJ 



w 



V" — 3 "* ,., ai k (i)(6u - a ik (t))h k j(t) 

dJ w da Ik (t) dh kj (t) 



dWa 



kj 



= E 



'-►fc 



u da ]k {t) dh kj (t) dWx kj 

Y J K 3 ^)x(t)Y J ^^r 

t i 

J2 Zij>(t)-aik(t) Y, Tj'W 



'-►fc 



'-►fc 



Notice the bracketed term that is common to the two 
expressions. Assuming, for simplicity, that j f = k, it can 
be rewritten as 



£ik(t) ~ a>ik(t)jk(t) = Jk(t) 
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&k(t) 



ik(t) 



<*>ik(t) 



input 
layer 



VN/vx^ 



hidden 
layer 



output 
layer 



I 



^9i V-i^, 



EI 


X 

OS 

E 

*i 
o 

CO 


g kj ' 



6, 



37 

Figure 7: An alternate network for the transition probabilities. Gkij> is 1 if f — » i and otherwise. 



outlining its role of weighted output error of the gating 
net. 

The modules / are linear in the parameters, therefore 
their derivatives computation is straightforward. 

Now we will show why the initial implementation is 
not fit for gradient ascent. Remember that in this repre- 
sentation the states of the Markov chain stood for states 
of contact and a^it) — P[q(t + 1) = i\q(t) = k,u(t)]. 
By gkj'(t) we denote the probability of using movement 
model f in the time interval (t, t-\- 1] given that the state 
of contact at time t was k. We have that 



Q>ik 



Z2 fJk i' ~ z2 Gki J 



'9kj> 



The structure of the network is given in figure 7. The 
elements of the matrix G are 1 or 0, indicating whether 
module j f leads to state of contact i or not. 

The function to be maximized has the same form as 
above, with the indices i, j both running now over the 
states of contact . Then its gradient w.r.t. g^j' is 



dJw 

dgjcj> 



&ik 



where i is the state of contact to which movement model 
j f leads (i.e. j f — » i). The above relationship shows that 
all the gates' outputs corresponding to the same state 
of contact receive the same error signal and therefore no 
competition can occur between them. 
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